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Abstract — Over the last decade several positive definite kernels 
have been proposed to treat spike trains as objects in Hilbert 
space. However, for the most part, such attempts still remain a 
mere curiosity for both computational neuroscientists and signal 
processing experts. This tutorial illustrates why kernel methods 
can, and have already started to, change the way spike trains 
are analyzed and processed. The presentation incorporates simple 
mathematical analogies and convincing practical examples in an 
attempt to show the yet unexplored potential of positive definite 
functions to quantify point processes. It also provides a detailed 
overview of the current state of the art and future challenges 
with the hope of engaging the readers in active participation. 



I. Introduction 

Information processing in the brain is carried out by a com- 
plex network of neurons communicating by sending reliable 
stereotypical electrical pulses known as action potentials, or 
spikes. Thus, the information is encoded in a sequence of 
events over continuous time, and not in the amplitude of the 
signal as is common in signal processing applications (see 
Fig. [T]l. Studying how information is represented and pro- 
cessed as spike trains — known as the neural coding problem — 
is one of the key challenges of neuroscience. We venture 
to say that the theory of how to represent information in 
continuous, infinite dimensional spaces is also far from being 
understood in the signal processing and machine learning 
communities. In light of the current signal processing focus in 
sparseness, point processes (that generate spike trains) are very 
appealing, since a point process provides the natural limiting 
case of sparse priors that underlie compressive sensing, and it 
implements the ultimate sparse representation: the system only 
communicates when the information crosses some internal 
threshold. This strategy saves power, and provides naturally 
a sparse representation in time, so the costly step of finding 
alternative spaces to map the input data for sparseness is 
unnecessary. The problem is that the system becomes less 
observable, and therefore algorithms intended to predict, con- 
trol or otherwise process the incoming information are less 
effective and much more cumbersome. The early attempts in 
the engineering literature to apply stochastic process theory 
to zero crossing analysis (a simple way to create a point 
process) started in the 40's with Rice at the Bell Labs, and 
found applications in frequency modulation (FM) and shot 
noise. The theory of point processes developed primarily in 
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the Statistics literature [1] and currently this theory is the most 
widely used approach to quantify spike trains in computational 
neuroscience as well as in all other areas of science and 
engineering. Point processes are also important for machine 
learning, in particular for online learning that deals with data 
streams, because of the shortcomings of vector spaces to 
represent both unbounded data and the resulting inference 
structure obtained after processing. 
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Fig. 1. Spike train observation. Typical intracellular membrane potential 
recording from a neuron is shown as a continuous time trace. Each occurrence 
of an action potential is the only information directly communicated to other 
neurons through synaptic connections. Spike train is represented as a sequence 
of times {ti, . . . ,tn) that action potentials are detected. 

Since the spike train space is devoid of an algebra, it 
imposes many challenges to signal processing methods. We 
must then first establish a space for computation or a transfor- 
mation to a space with the necessary properties. The approach 
explained here is to define a proper kernel function on spike 
trains to capture nonparametrically the temporal structure and 
the variability of the spike trains of interest. Once a positive 
definite kernel is defined, it maps the spike trains into a Hilbert 
space of functions which allows signal processing tools to be 
applied directly through the kernel trick. This methodology 
has the potential to enlarge the footprint of digital signal pro- 
cessing to objects that are non-numeric, i.e., we can filter spike 
trains, decompose them in principal components, and perform 
inference, with exactly the same tools available for time series 
defined in M. But more importantly, the use of kernels provides 
an opportunity for a myriad of advanced machine learning 
tools such as Gaussian processes, and probability embedding 
to be applied to spike trains, opening up a new frontier for 
next generation spike train signal processing. 



A. Neuroscience and neural engineering problems 

The idea of a neural code is prevalent in the sensory and 
motor systems where the variables of interest are directly 
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Fig. 2. Illustration of the neural decoding problem in the sensory and 
motor system. Note that decoding is causal for behavior, and anti-causal for 
sensation and perception. 



observable, although it is latent in all neuronal communication. 
In a sensory system, we would often like to understand how 
the sensory stimuli are encoded and transformed in each stage 
of neural processing. For example, visual stimuli excite a 
cascade of neurons in the visual pathway from photoreceptor 
neurons, and retinal ganglion cells in the eye to various areas 
of the visual cortex. By analyzing the spike trains, we can 
understand how certain aspects of a stimulus are processed 
and represented in these neurons |I2J. The study of neural code 
often consists of 

1) identifying neurons that encode certain features of inter- 
est (neuron identification), and 

2) finding the functional relation between the feature and 
spike trains of the identified neurons (neural encod- 
ing/decoding). 

A major challenge in neuron identification is the neural 
variability, or "noise", in the system. For example, when a 
fixed stimulus is repeatedly presented to the system, the trial- 
to-trial variability of the neural response is often complex 
(see Fig. |3]l. Therefore, to determine if a neuron encodes 
for the variable of interest x, one cannot compare simply 
single trial responses, but requires collections of responses 
that are samples from the stimulus-conditional distribution 
p( spike train I a;). Fortunately, the tools from signal detection 
theory and hypothesis testing can be extended to stochastic 
spike trains via kernels to solve the neural identification 



problem (section III-Ai. 

On the other hand, a better method for reading out the 
stimulus from the spike trains (neural decoding) can have 
major impact on a number of clinical and biomedical ap- 
plications. For example, it can improve sensory prosthetics 
such as cochlear implants, which are widely used, and retinal 
and tactile prosthetics that are under active development [[3l, 
im, [S). In motor systems, identifying which neurons are 
involved in motor planning and control, and understanding 
how information is represented in spike trains is essential in 
building motor prosthetics 161, Q. Similar approaches have 



been taken for various higher-level cognitive systems such as 
decision making, memory, and language fS^. 

Spike train kernels provide alternative tools to the neural 
coding problem (Fig. [2|i. Traditionally, the "rate code" has 
been the dominant idea in neuroscience |^, fTOl, and it has 
been repeatedly demonstrated experimentally. The rate code 
hypothesis states that the average spike count encodes all 
information underlying the stimulus, i.e., that the spike timing 
is not useful for neural processing. Contrary to the rate code 
hypothesis, there is also ample evidence for the so-called 
"temporal code" hypothesis which states that extra information 
is encoded in spike timings ifTTl . llT2l . |fT3| . The neuroscience 
community, however, has largely relegated the possibility of 
a temporal code to a secondary role perhaps due to the large 
dimensionality of the neural code space and the limited ability 
of statistical methods that directly operate on spike trains and 
are powerful enough to discover new patterns. If the brain 
processes and communicates sensory data optimally amongst 
neurons, one natural solution is to utilize a representation that 
preserves as much of the information as possible |14|. Along 
this line of reasoning, the timing hypothesis should be the 
preferred theory because it is the one that guarantees no loss 
of information and it solves the conundrum: in cases where 
it is impossible to use rates (when the response time has to 
be minimized), spike times are preferred, but a representation 
that is sensitive to spike times also can easily represent rates 
by integration. Practically the argument between rates and 
timing is also biased by the degree of sophistication of the 
mathematical tools required: it is difficult to quantify spike 
timings, while it is very easy to process rates, therefore there 
may have been many experimental observations that corrob- 
orate the spike timing hypothesis that were never published 
because researchers could not quantify appropriately their data. 
Spike train kernels shine a new light into this controversy by 
providing a general framework for studying spike trains that 
can accommodate both hypotheses. We hope that, by focusing 
on what is common, the spike train kernel approach may kindle 
experimental research to show that different neuron classes 
are optimized for different time scales of processing, just like 
engineers design differently transistors for high speed CMOS 
and sample and hold circuits. 

B. Kernels and kernel methods 

The practicality of signal processing is due to a clever 
exploitation of the linear model. Unfortunately, not all the 
problems we want to solve are well approximated by the linear 
model. The Hilbert space approach lfT6ll . and more specifically 
the reproducing kernel Hilbert space (RKHS) [17] extend the 
linear model mathematics to nonlinear modeling in the input 
space. The methodology is principled because it provides a 
general way to handle different types of nonlinearity, the 
optimization is convex, and the methodology is still practical 
in terms of computational complexity. But in our opinion, the 
true importance of kernel methods for neural signal processing 
is their ability of map abstract objects to an Hilbert space — a 
linear functional space equipped with an inner product. Indeed, 
at the core of the above mentioned problems in neuroscience 
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Fig. 3. Trial-to-trial variability. Neural response to repeated stimuli are 
variable. The variability structure can differ for different neurons and contexts. 
These trials are from retinal ganglion neurons 1151 . 



to quantify spike trains is the lack of standard algebraic 
operations such as linear projection and linear combination 
for spike trains. This mapping supplies the required structure 
for applying most signal processing tools, and also allows 
otherwise complex nonlinear computation. 

The theory of reproducing kernel Hilbert spaces provides 
a foundation for the existence of a (possibly) infinite di- 
mensional Hilbert space — a feature space — associated with 
any positive definite function of two arguments called a 
kernel ifTSl . lfT7( . Let the input space data X be an object — 
e.g., a point in E'^, a graph, or a spike train — and a kernel 
k:XxX— >Mbea real-valued bivariate function defined in 
the input space X. The input sample a; e X is mapped to the 
RKHS as the function k{x,-), therefore the kernel specifies 
the richness of the transformation. The kernel defines also 
the inner product of the Hilbert space, i.e. the kernel k{x, y) 
provides the similarity in the RKHS of the functional images 
of any two samples x and y in the input space and encapsulates 
any prior knowledge about the input space structure. Moreover, 
the inner product of two functions in the RKHS can be 
computed by a scalar kernel evaluation in the input space, 
i.e. {x\y)-u — K{x,y). This property brings computational 
simplicity, therefore we have a principled framework that 
allows nonlinear signal processing with linear algorithms and 
makes working with functions practical. 

For any finite set of points in the input space {x^}"^]^, the 
resulting matrix K, 



K = 



fi{xi,Xi) k{xi,X2) ... K{xi,Xn) 
K,{x2,Xi) k{x2,X2) ... K{x2,Xn) 

K^XmXi) K(XmX2) ■■■ l^(,XmXn\ 



(1) 



must be symmetric and positive semi-definite for a proper 
kernel k, i.e., for any real vector x e M", x^Kx > 0. Given 
data in X, the kernel matrix K represents the inner product 
between each pair in the Hilbert space. The kernel matrix plays 
a central role in kernel method algorithms, because for most 
algorithms, it contains all information required about the input. 

Let us illustrate the importance of the kernel design with 
kernel mappings on the real line where we know the feature 
space explicitly. If we map a; e M to a three-dimensional 
feature vector [1, y/2x^ x^] £ M.^, then linear regression in the 
feature space corresponds to a quadratic fit in M. Equivalently, 
this quadratic fit can be achieved by kernel least squares using 
the polynomial kernel K{x,y) = (1 + xy)^ without explicitly 
constructing the feature space \TT\. This is because the least 
squares linear regression only requires operations provided by 
the Hilbert space (linear combination, and inner product) and 
the polynomial kernel is the inner product of the feature space. 
The advantage of kernel method is avoiding the intermediate 
feature space representation especially when it is of high 
dimension. 

One popular kernel is the Gaussian (a.k.a. squared expo- 
nential) kernel K{x,y) = exp(— (a; — ?;)^/cr), which implicitly 
corresponds to an infinite dimensional feature space. It cap- 
tures the local similarity in the real line; x and a; + e are 
assumed to be very similar if |e| <C cr, and gradually becomes 
dissimilar as |e| increases. However, the choice of cr is critical. 
As the kernel size parameter a tends to zero, it approaches the 
trivial kernel, k{x, x) — 1, and k{x^ y) — for x ^ y, that still 
maps the input to an infinite dimensional space where every 
mapped input point is orthogonal to each other, and hence 
the feature space has no ability to generalize and basically 
acts as a lookup table. On the other hand, if a is larger than 
the dynamic range of the data, the Gaussian kernel provides 
basically a constant mapping of the argument, therefore the 
feature space is unable to weight distances differently and 
looses the ability to discriminate distinct points. These two 
example kernels are extremes that do not bring any advantage. 
In practice, we use a kernel (and kernel size) that is in between, 
one that provides a rich feature space with proper smoothing 
such that the practitioner can use it to nonlinearly interpolate 
between data points (e.g., in the scale of cr for the Gaussian 
kernel). 

II. Spike train kernels 

In the previous section we discussed kernel methods on the 
real line, but the beauty of the theory is that it can be applied 
to other more abstract spaces. In fact, various discrete objects 
naturally represented as a collection such as graphs, sets, and 
strings have been successfully embedded in Hilbert spaces 
through kernels jT9l, ll20l . ||2TI . opening the door for many 
traditional tools to be directly applied when a suitable kernel is 
used. The only difference with respect to the Gaussian kernel 
is that we now need a way to define the kernel in a way that is 
relevant to measuring the similarity between spike trains. Once 
this is done, we can replicate the same operations as explained 
for the Gaussian kernel on the real line, i.e., we can quantify 
similarity between spike trains and define a space where we 
can build signal processing models and perform inferences. 
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In the remainder of the section, we introduce several im- 
portant spike train kernels and discuss their advantages and 
disadvantages. 

A. Count and Binned Kernels 

A trivial way of constructing a spike train kernel is by 
first mapping the spike trains into a finite and fixed dimen- 
sional Euclidean space and using its inner product: x, y e 
M'',K(x,y) = x^y. For example, simply counting the total 
number of spikes in each spike train maps spike trains into 
natural numbers which is a subset of the real fine. The resulting 
kernel is called the count kernel. Count kernel completely 
ignores the timing, but it is useful because it encompasses the 
conventional rate coding analysis done by neuroscientists. For 
instance, the optimal least squares linear decoder is equivalent 
to the kernel least squares, and the test of mean rate difference 



Non-binned representation 



is equivalent to MMD (see section III-A i with the count kernel. 

A naive extension of the count kernel is to bin the spike train 
by choosing a sequence of consecutive time windows (Fig. |4|i. 
Neighboring time bins corresponds to different dimensions, 
and hence the kernel does not bring any smoothing across 
the bin boundaries. In the limit of fine time binning, all 
information of the continuous representation is preserved in 
the binned representation at the expense of huge dimen- 
sionality. When combined with the Euclidean inner product, 
binning in this regime is catastrophic because the inner product 
implements a look-up table, like the trivial kernel mentioned 
earlier On the other hand, when the bin size is larger, the 
temporal continuity within each time bin is respected, and 
some smoothing is provided, however, the resulting feature 
space is low dimensional, and temporal details in the spike 
trains cannot be fully represented. For some applications, there 
is a sweet spot for the bin size that may perform well, since 
it can partially extract linear relations with respect to the rate 
code hypothesis [22 1. The linear model on binned data, so 
popular in brain machine interfaces (BMIs) f6l is one example 
of this technique. 

B. Spikernel 

Although directly binning spike trains as objects in the 
Euclidean space can be misleading, a better kernel can 
be constructed using this representation but different inner 
product. The first successful kernel for neuroscience is the 
spikernel li23]| which falls in this category. It allows local 
time manipulation (time warping) enabling spike counts in 
neighboring bins to be matched, effectively smoothing over 
bin boundaries. It also weights different time bins according 
to the distance from the time of interest, which is a reasonable 
assumption based on the finite memory of neural systems. 
The spikernel has been successfully demonstrated to perform 
better than binned count kernel in the context of brain- 
machine interfaces (decoding the motor signal from neural 
activity) El . 

In general, the spikernel performs robustly Il24l . however, 
it fundamentally lacks the ability to control its temporal 
precision since it is tied to a binned representation. In ad- 
dition, it should be noted that the spikernel is computationally 
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Fig. 4. Advantages and disadvantages of tine and coarse time binning 
representation. The continuous time axis is divided into a sequence of fixed 
intervals, and tlie number of spikes occuiTing within each time bin is counted. 
Two bin sizes with different offsets are shown. Red numbers indicate changes 
in the representation due to the small temporal jitter. Small bin sizes allow 
representation of fine temporal structure, while small fluctuations causes 
radical changes if the spike trains are considered as a vector in the Euclidean 
space. Larger bin sizes smooth the time, so they are less sensitive to small 
fluctuations but detailed temporal information is lost. Hard boundaries of 
binning can be relaxed by using a smooth basis function centered at fixed 
times, such as a Gaussian or raised cosine function instead of a rectangular 
window. However, the resulting feature space still has limited dimensionality 
because of the quantization imposed on time by these techniques. 



expensive to evaluate, and it requires tuning five free param- 
eters, including bin size. The relatively large number of free 
parameters delivers a flexible kernel, which is supported by its 
performance, but tuning these parameters requires an extensive 
optimization that hinders its appeal. 



C. Linear functional kernels 

As we have seen earlier, the binning transformation is 
lossy — many spike trains can be mapped to the same binned 
representation — and similar spike trains can be mapped to 
quite different representations. How can we avoid binning 
and preserve all information and create a positive definite 
kernel? One solution is to use an infinite dimensional rep- 
resentation li25l . 

Let h he a finite energy impulse response of a linear filter 
over time (possibly non-causal), and represent the spike train 
as a sum of Dirac delta functions x{t) = ~ ^i) (non- 

binned representation, see Fig. |4|. Each spike train can be 
uniquely transformed into a function via convolution with a 
non-trivial h: 

f,{t) =x*h = J2h{t-U). 

i 

The resulting transformed spike train (t) is a function in L2 
space, which is a Hilbert space on its own. The inner product 
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Fig. 5. Linear functional kernel. (A) Two spike trains smootlied witli a 
function h{t) = e~ in red. Tlie spike timings of tlie first spike train are 
jittered to generate the second spike train, and their smoothed representation 
are similar in L2. (B) Demonstration of linearity of linear functional kernels 
using h{t) = e~^/'^ I{t > 0). The third spike train is a superposition of the 
first two spike trains, and so is the corresponding smoothed representation. A 
schematic of the three vectors in L2 explicitly indicates their linearity. 



in L2 is defined as, 

(/l.9)= j f{t)g{t)dt. 

By choosing a locally concentrated h, spike trains with sim- 
ilar small jitter are mapped to similar functions (Fig. |5]\). 
Therefore, it is continuous with respect to small temporal 
perturbations. The linear functional kernel is simply defined 
as, 

<x,v)^{f^\fy)^ j{x*h){t)[v*K)(t)dt, (2) 

that is, the inner product of the two smoothed functional 
representations in L2 (Fig. |5]l. Note that (|2]i can be rewritten 
with an explicit summation over all pairs of spikes, 

<x,y)^Y. j Ht-tf)hit-ty)dt = J29it-4) 

where g{u, v) — J h{t — u)h(t — v) dt. Therefore, the kernel 
evaluation in the infinite dimensional space can be computed 
with 0{NxNy) function evaluations of g where Nx is the 
number of spikes in the spike train x. 

A few special choices of h are worth noting Ii26l . ||25l . If 
the smoothing function has the form of a Gaussian function, 
h{t) = e-*'/2<T"^ jjjgjj g(^u,v) = e~ /'^'. More impor- 
tantly, if the smoothing filter is a causal exponential decay 



h = e for t > 0, then we obtain the following kernel: 

K(x,2/) = ^exp('-i|if-tf|y (3) 

This spike train kernel has two distinct advantages. First, 
it has a neurophysiological interpretation, since the synap- 
tic transfer function that transforms the spike trains to an 
analogue intracellular signal in the downstream neuron can 
be approximated as a first order dynamical system (i.e., first 
order IIR filter) ETl . Il28l . Second, it can be computed in 
OiiNx + Ny) \og{Nx + Ny)) time |231. The fast computation 
is due to properties of the double exponential function [29 J . 

It is easy to see that the linear functional kernels are linear 
with respect to the superposition of spike trains. If x{t), y{t), 
and z{t) are spike trains represented as sum of delta functions, 
{x + y\z) — {x\z) + {y\z) (Fig. |5|3). Therefore, the functions 
built on this space have the same constraint; the function value 
for the superposition x{t) + y{t) is the sum of the function 
value for x{t) and y{t). Note that the binned spike train kernels 
share this property of linearity with respect to superposition. 
We will see that this limitation can be a critical weakness. 

D. Nonlinear functional kernels 

To unleash the full potential of kernel methods, we need 
binless nonlinear spike train kernels. There are several ways to 
extend the linear functional kernels to be nonlinear ll24l . Il30l . 
ESI, ll28l . lISTI . Here, we focus on building the Schoenberg 
kernel since it provides a provably universal kernel. Schoen- 
berg kernels are derived from the radial basis functions, and 
takes the following form, 

n{x,y) = (l){\\x-yf'^ (4) 

where the function : [0, 00) — > M is completely monotone 
on [0, 00) but not a constant function 1321 . Examples of 
completely monotone functions are e~"^, (^_)_^2)/3 where a 
and /3 are constants ||33]| . 

We take the functional norm derived from the linear func- 
tional kernel, that is, 

Ik - y\? ^ {x- y\x - y) = {x\x) - 2{x\y) + {y\y). 

Next, we build radial basis kernels on top of the feature space 
induced by the Unear functional kernel. Therefore, in Schoen- 
berg kernels, the underlying linear functional kernel provides 
the smoothness in the space, and the radial basis function (f) 
enforces the linear superposition to only hold locally. This 
combination guarantees the resulting kernel to be powerful 
for both neural identification and decoding applications. 

A typical choice is to use ([3]) as n' with (t){x) = e~"^ which 
results in the following form: 

k{x, y) = exp x) - 2k! {x, y) + ^'(y, y))^ . 

(5) 

This can be considered as an analogue of the widely used 
Gaussian kernel for Euclidean space. Schoenberg kernels are 
universal, in the sense that they can asymptotically approx- 
imate arbitrary nonlinear function from spike trains to reals. 
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They have an additional scale parameter a which controls how 
much smoothing is applied to the space induced by the base 
kernel. 

E. Extending single spike train kernels to multiple neurons 

So far we have introduced kernels that compute similarity 
between a pair of spike trains (either from a single neuron at 
different times or from a pair of neurons). However, recent 
recording techniques allow simultaneous recording of up to a 
couple of hundreds of neurons. Thus, we need kernels for 
a pair of sets of spike trains from many neurons. There 
are a couple of simple yet effective ways to extend single 
neuron kernels to multiple neuron kernels (see ||34l . ifTTll for 
combining kernels). First is to use a product kernel, 

i^{x,y) ^W_Hi{xi,yi) (6) 

i 

where i indexes over simultaneously recorded neurons. Second 
is to use a direct sum kernel, 

n{x,y) ^^aiK^{xi,yi), (7) 

i 

where are weights for combining the effect of each neu- 
ron. The product kernel is the natural inner product of the 
product Hilbert space, and the direct sum kernel is that of 
the direct sum Hilbert space. The product kernel preserves 
the universality of elementary kernels (e.g., with Schoenberg 
kernels), but if the effective dimension of the spike train 
manifold increases (as in the case of less dependent spike 
trains and/or independent noise processes) the number of spike 
trains required to "fill" the space increases for the same kernel 
size. Hence, more smoothness may have to be incorporated 
(imposed by kernel sizes), or exponentially more data may be 
required to estimate equivalently detailed nonlinear functions. 
The direct sum kernel does not preserve universality; in fact, 
only additive functions over multiple neurons are spanned 
by those kernels. Therefore, unless such constrains are of 
interest, it is not useful for general neuroscience applications. 
In general, combining kernels increases the number of hyper- 
parameters, making cross-validation less practical, hence we 
recommend empirical Bayes methods for their estimation ll35l . 
US). 

Although it is possible to form a product kernel from the 
spikernel, it is not necessary to do so because the spikernel 
can be extended directly for multiple neurons by considering 
a vector of spike counts for each time bin ||23]| . In such 
construction, the time warping is uniformly applied to all 
spike trains. Since the time complexity is only additive for the 
number of neurons, for a large population recording, spikernel 
could be computationally advantageous. 

III. Applications 

Equipped with spike train kernels, we can now discuss 
application areas in neuroscience and neural engineering, each 
of which requires a different class of kernel methods. We 
discuss the problem of hypothesis testing first, followed by 
stationary neural code analysis using regression and online 
neural decoding with adaptive filtering. 



A. Neuron identification 

Due to the trial-to-trial variability of neural responses 
(Fig. [3]l the collection of responses to repeated stimuli can be 
considered as realizations of a random process. When realiza- 
tions are spike trains, the corresponding mathematical object, 
the probability law over spike trains, is called a point process. 
It is often necessary to determine if two sets of responses 
given different experimental conditions are different — we want 
to know if the response carries any information about the 
experimental condition of interest. For example, some neurons 
in the visual cortex encode the stimulus color regardless of the 
motion, while some encode the directional motion regardless 
of the color. 

In practice, a severe bias may be unwillingly included when 
searching for a neuron that encodes information about a certain 
feature especially in the context of in vivo electrophysiology. 
In a typical setting, a trained electrophysiologist would listen 
to the firing pattern of each neuron and make a decision on 
the fly to record from the probed neuron, which tends to be 
the one with larger firing rate modulation. Identifying neural 
selectivity has been widely done assuming that the information 
is only represented in the mean statistics (firing rate). The 
conventional estimator is a histogram (or a smoothed version), 
known as the peri-stimulus time histogram (PSTH), which is 
obtained by binning the time, and averaging over trials. The 
difference in the mean statistics is then used for associated 
hypothesis testing 1371 . Il38l . 

Another widely used test is the Wilcoxon rank sum test 
on the spike counts ll39ll . It nonparametrically captures the 
difference in the distribution over counts, therefore, more than 
just the mean count is tested. Since the count distribution 
must span across multiple values to be meaningful, it requires 
a large window that captures relatively many spikes, and 
it is difficult to apply it to multiple bins. Thus, the count 
distribution cannot capture the differences in the temporal 
structure. For these reasons, these widely used parametric 
statistical tests are fundamentally incapable of discriminating 
many features. 

Instead, we submit that a class of nonparametric statistics 
is needed that can discriminate among many point process 
distributions, either in terms of higher order statistics or the 
temporal dimension. Such a statistic is known generally as 
a divergence ||40l, JSOl, f24l. One can define a diver- 

gence measure from kernels by utilizing a recent development 
known as probability embedding, which provides a general 
construction of divergence measures by representing the data's 
probability distribution as a point in the feature space |42|, 

ii, fflj, ma. 

The idea of probability embedding is very simple. Use a 
kernel to map the samples to the feature space, and take 
the mean to represent the (empirical) probability distribution. 
This is possible because the feature space is a (linear) Hilbert 
space. As long as one uses a powerful enough kernel, this 
simple process gives a unique representation of the empirical 
probability law that asymptotically converges to the true 
distribution. Technically, it is sufficient to show that the kernel 
K is strictly positive definite (spd) in the integral sense, that 
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Fig. 6. Illustration of maximum mean discrepancy (MMD) and hypothesis testing results. (Top row) Two sets of artificial spike trains shifted half a cycle 
is analyzed with tliree different kernels: spikernel, linear functional kernel |3j, and Schoenberg kernel j5|. The spike trains mapped in the feature space are 
visualized in two dimensional subspace determined by the kernel principal component analysis that preserves the most of MMD 1 24 1 . The distance between 
the mean in the feature space corresponds to MMD. In this example, all three kernels successfully rejected the null hypothesis, indicating the corresponding 
point processes are distinct. (Bottom row) Hypothesis testing performance on spike trains recorded from retinal ganglion neurons. We used the rat retinal 
ganglion cell recordings from Nirenberg's lab 1 15] (22 stimulus conditions repeated 30 times for 3.1 sec from 15 neurons). The task is to discriminate different 
stimulus conditions given the spike trains from a single neuron. For each neuron, we paired the stimulus conditions to have close median firing rate response 
by sorting the conditions and pairing neighbors for the test. Higher rejection rate implies a more practical divergence for the test. Dashed line is the test size 
(p = 0.1). Error bar indicate the standard error (165 tests). (Copyright 2012 MIT Press Journals. Modified from t24J with permission). 
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k(x, y)p{x)p{y) dxdy > 



(8) 



for any probability distribution p of consideration on the input 
space [24 1 . 

Interestingly, the mean of the point process in the Hilbert 
space for the binned or Unear functional kernel results in 
estimators for the firing rate function. The PSTH can be 
formed by using a binned spike train kernel, and a smoothed 
estimate of the intensity function can be produced by using 
a linear functional kernel. Given a collection of spike trains 
(with possible repeats), {xi}f, the mean in the Hilbert space 
corresponding to a linear functional kernel k can be repre- 
sented as a function over arbitrary spike train z, 



^ N ^ N Ni 



Since the kernel is linear for superposition, the mean does not 
depend on which spike train a particular spike came from. 
Therefore, the mean does not capture any statistical structure 
between spikes within the same trial. As can be expected. 



neither the binned kernel nor the linear functional kernels is 
spd HH. 

For spd kernels, the mean contains all information about 
the collection of spike trains. This is not surprising, given 
that unique spike trains mapped to the Hilbert space are 
not only unique, but are mutually linearly independent (the 
Gram matrix is full rank). The mean "remembers" the set of 
spike trains that formed it, except for the ordering. What is 
important is that the mean in the Hilbert space is a smoothed 
representation, and hence if the spike trains that consist the 
mean are similar, they are close in the Hilbert space. 

A divergence measure for empirical observations can be 
defined as the distance of the means for a pair of collections 
of observed spike trains in the Hilbert space, 

2 

1 NT^ , s 1 



({^Jf,{%}f) = 



K,(Xi, •) 
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where k is a spd kernel. This divergence statistic D is called 
maximum mean discrepancy (MMD) |42]. When MMD is 
large, it is an indication that the two point processes are 
different. In the classical hypothesis testing framework, we 
need the distribution of MMD under the null hypothesis which 
assumes that both collections originate from the same under- 
lying random process. We can generate MMD values from the 
null distribution by mixing the samples from both conditions 
and resampling from the mixture |42|, |24|. The following 
simple procedure describes a typical hypothesis testing given 
two collections of spike trains {xi)f 
test size a: 

1) Compute the kernel matrix K 

2) Compute = j^l^K(/,/)l 
^1TK(/,J)1 

Bootstrap randomly permuted indices of size N and M 
with replacement and recompute the statistic of the null 
distribution 

If is above the (1 — a) quantile of the bootstrapped 
null distribution, reject the null hypothesis, otherwise 
accept it. 

The smoothness of the probability embedding is controlled 
by the spike train kernel of choice, and hence it is important 
to choose a kernel that captures the natural similarity of spike 
trains well. This may come as a surprise since all spd kernel 
are asymptotically equivalent for MMD, that is, if the two 
underlying probability laws are different, any spd kernel can 
discriminate given a large enough sample. Yet, the small 
sample power of the divergence test is greatly enhanced by 
encoding more prior information of the similarity into the 
spike train kernel. 

B. Neural decoding 

Neural decoding searches for the detailed relationship be- 
tween neural activity and the variable of interest (Fig. |2|. 
Successful decoding analysis often provides evidence (or new 
hypothesis) for specific coding schemes the neural system 
uses, functionally identifies the system, and moreover, it can 
be used to develop neural prosthetics and interfaces. 

Depending on the modality of the target variable, neural 
decoding can be tackled by different tools. When the target 
variable is categorical (finite number of possibilities), classi- 
fication algorithms are suitable; e.g., object recognition and 
multiple choice decision are naturally categorical. If the target 
variable is continuous valued and fixed for each trial, but 
jumps from one value to another, then regression tools are 
appropriate. Such trial-based experimental design is commonly 
used for studying the neural code. When a continuous target 
variable is varied through time, filtering and smoothing al- 
gorithms are appropriate. Most primary sensory as well as 
motor features naturally fall in this category, and are of most 
practical use in neural engineering. Here we will focus our 
discussion on regression and filtering. By regression, we mean 
batch analysis, while by filtering, we refer to online (real-time) 
signal processing. 

A traditional approach to map single or multiple spike 
trains to a continuous target variable is linear regression 



Gaussian process prediction on test set 





input spike train 



Fig. 7. Regression example using Gaussian processes on a syntlietic problem. 
Gaussian process regression is used to estimate a target variable on a training 
set in a test set. Decoding traces with their credible intervals (2 standard 
deviation of the posterior around the mode) are plotted. The target variable 
is generated through a second-order Volterra system given by, j/(t) = 0.1 + 
k\(t — ti) + k2{ti, tj) (right). (Volterra kernels ki and k2 are not 
to be confused with the spike train kernel used for decoding.) 



on binned spike trains with relatively large bin sizes |46|, 
103, BSl . Again, the rational stems from the neuroscience 
literature which focuses primarily on the information carried 
by the mean firing rate, and little about the detailed temporal 
structure within each trial. Despite their crudeness, linear 
models on binned spike trains perform reasonably for motor 
brain machine interfaces, because the time scale of behavior 
is at the hundred milliseconds scale. 

For filtering, conventional linear filtering methods such 
as least mean squares, recursive least squares, and Kalman 
filters are often used, and recurrent neural network approach 
for filtering is also worth mentioning (49). In recent years, 
state-space based Bayesian filtering approaches have been 
popular- ISOll . lISTI . ||52]| . ||53l . A state-space (also known as 
latent-variable) model combined with an encoding model from 
continuous observation to spike trains is inverted using the 
Bayesian filtering framework. This method requires a good 
encoding model which has to be fit ahead of time, and is 
based on stationary assumptions during and between training 
and testing conditions. Because of neural plasticity, in practice 
frequent retraining or sophisticated tracking is needed. 

1) Gaussian process regression: Gaussian process (GP) 
regression is a widely used nonparametric Bayesian regres- 
sion method |35|. For neural decoding, we assume a prior 
distribution over the functions from spike trains to M. This 
prior is explicitly stated to be a Gaussian process with co- 
variance provided by a kernel; the covariance of the function 
values evaluated at two spike trains x and y is given by 
cov(/(a;), /(?/)) = k{x, y). To be specific, given a set of spike 
trains {xi]'^^^, the distribution of the function values from the 
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prior is multivariate Gaussian distributed 

[/(xi),...,/(x„)]'-AA(0,K) 

where K is the kernel matrix ([TJ. 

Using the GP prior, we can infer the posterior assuming a 
Gaussian noise corrupted observation model. The prediction 
of the function evaluated at spike train z is given by, 

n 

f{z) ^^aiK(xi,z) (10) 

i 

where a = {'K + af^I)^^y, where cr^ is the observation noise 
variance and y is the desired signal vector corresponding to 
training data {xi}"^^. 

There are several advantages of GP: 

1) the prediction coincides with kernel ridge regression 
(regularized kernel least squares), but GP provides the 
posterior credible interval (not to be confused with the 
frequentist confidence interval) which indicates the un- 
certainty of the prediction under the model assumptions, 

2) given a universal kernel, it can learn any nonlinear 
functional relation, and, 

3) hyperparameters (kernel parameters and observation 
noise variance) can be tuned in a principled way using 
empirical Bayes procedure. 

In figure [7] we compare GP regression with linear func- 
tional, Schoenberg, and spikernel in a synthetic example where 
a Poisson spike train is mapped to a real-valued signal through 
a second order Volterra system. The hyperparameters are 
learned through empirical Bayes method where the marginal 
likelihood is maximized on the training set (400 points). The 
linear functional kernel of ^ does not perform well on 
mean prediction (red trace) because of the strong nonlinear 
component (pairwise interaction of spike timings due to the 
second order Volterra kernel), while the spikernel obtains a 
reasonable prediction, and Schoenberg kernel of (|5]l achieves 
very high performance. The credible interval resulting from 
using the Schoenberg kernel is the smallest, meaning the 
model is confident that the data is well described by the 
regression result. In contrast, the inferred credible interval for 
the spikernel is large, meaning at least some aspects of the 
data are not well described by the fit model. 

2) Kernel adaptive filtering: For closed loop applications, 
the system identification and prediction benefit from sequential 
processing where the system parameters are adapted with 
every new sample because neural systems are plastic and 
there are real time constraints in the experimental setup. 
Therefore, adaptive filtering algorithms have been widely used 
in the brain-machine interface applications, and other neural 
prosthetics |6|. As stated, linear filtering algorithms such as 
least mean squares (LMS) and recursive least squares (RLS) 
algorithms as well as Kalman filtering have been successful us- 
ing the binned representation, but performance improvements 
are still needed. 

Kernel adaptive filters (KAF) have been recently developed 
that kernelize the linear adaptive filtering algorithms |54|, 
inheriting their simple computational structure, and extending 
them to nonlinear transfer functions. Similarly, KAFs operate 



on a sample-by-sample basis, and can deal with non-stationary 
environments. 

Here we describe the simplest yet powerful kernel least 
mean squares (KLMS) algorithm, which has been successfully 
applied in neural engineering as inverse control in the context 
of somatosensory prosthetics [56 1. KLMS is a nonlinear ex- 
tension of the popular LMS algorithm that can be derived as 
a stochastic gradient descent on the mean squared error but 
with an infinite dimensional weight vector The filter mapping 
is a function of the same form as ( [TO| i, where the a's are 
proportional to the instantaneous errors: 

O^i ^ -viVi - fi-i{xi)) (11) 

where fi^i is the estimated filter before the i-th observation, 
and \ri\ < 1 is the learning rate |54|. Thus, the filter input- 
output map is fully represented by pairs of real coefficient 
ai and observed spike train Xi. The KLMS has been applied 
with advantages in nonlinear signal processing of continuous 
amplitude signals, mostly using the Gaussian kernel, but one 
of the advantages of the RKHS approach is that the algorithm 
formulation is independent of the kernel. Therefore, any of the 
spike train kernels presented in this paper can be directly used 
in the KLMS. 

We demonstrate online inverse modeling using KLMS in 
figure |8] 1551 . If57l . The goal is to reconstruct properties of 
the induced tactile stimulation (here the derivative of the 
force) from the generated multi-channel spikes in the thalamus 
(VPL) and primary sensory areas (SI) upon stimulation. Given 
the finite memory in the neural system, typically a moving 
window is usedfor example, of 80 milliseconds slided 2 ms at 
a time over the spike data. We use the product kernel in this 
MISO KLMS model, train a decoder on the tactile stimulus 
time series on a training set, and compare reconstruction 
performance with different spike train kernels on a test set. 
Overall, a similar trend to previous examples is observed 
in this application where the Schoenberg kernel outperforms 
the spikernel and the linear kernel, both in terms of faster 
convergence in the training set and better reconstruction of 
the stimulus in the test set. 

IV. Discussion 

Spike train kernels enable signal processing and machine 
learning of spike trains by providing a feature space for 
computation. In this article, we surveyed how positive definite 
functions can be used to quantify spike trains, at the rate or 
spike timing level, and implement important operations for 
neural decoding such as hypothesis testing using probability 
embedding, and estimation of continuous functional mappings 
from spike trains. As we have briefly demonstrated, spike 
train kernels provide a unifying framework for most of the 
techniques used in spike train processing, from the statisti- 
cal rate descriptors, to the binned representations up to the 
full description (injective mapping by a spd kernel) of the 
spike train timing structure in infinitely dimensional functional 
spaces. The approach is therefore versatile and mathematically 
principled, extending the footprint of signal processing and 
machine learning to more abstract spaces. 
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simultaneously recorded spike trains from 19 neurons 




online learning step (samples) 



Fig. 8. Decoding tactile stimulus from awake rat thalamus and somatosensory spike trains using kernel least means square (KLMS) algorithm. 80 ms 
sliding window over the spike trains is used to decode stimulus sampled at 100 Hz. Hyperparameters (kernel size and learning rate) are optimized for best 
performance on a cross-validation data (1 sec). First derivative of the force applied to the finger is the target variable, and 19 neurons. Learning curve shows 
the convergence of online learning (12 seconds of training is shown) quantified on a test dataset of 2 seconds. (Adapted from [55]) 



Among the spike train kernels, we have promoted the 
binless Schoenberg kernel, since, (1) it provides an injective 
mapping, (2) it can embed arbitrary stochasticity of neural 
responses as the sample mean in the RKHS, and (3) it is 
a universal kernel that can approximate arbitrary functions 
on spike trains. Such arbitrarinesses in the point process 
and function classes are allowed because of the strongly 
nonparametric nature (spd) of the kernel mapping. 

A. Interpretation of decoding analysis 

Like other advanced signal processing methodologies (e.g., 
deep learning and nonparametric Bayesian methods), strictly 
positive definite spike train kernels make the results less 
interpretable due to the high dimensionality of the implicit 
feature space. Weaker kernels only encapsulate explicitly cer- 
tain designed features, for instance, the count kernel is only 
sensitive to the total spike count, and the linear functional 
kernel is mostly sensitive to firing rate profiles. Although 
the stronger kernels can capture arbitrary features, they are 
not unique. Therefore designing explicitly stronger spike train 
kernels is non-trivial, because it is hard to understand what 
spike train features they are emphasizing. There are several 
ways we can partially recover the intuition, although more 
research is needed in this direction: via visualization of spike 
trains in the feature space in the case of MMD-PCA (Fig. |6|, 
via sparse regression methods like relevant vector machine, or 
via kernel selection over a set of strongly interpretable weaker 
(more parametric) kernels. 



Another caveat of decoding analysis is that successful de- 
coding does not imply that the brain is using the information, it 
only signifies that the information exists in the collected neural 
signals. For example, in early sensory neurons like retinal 
ganglion cells or auditory fiber, we can often decode and 
reconstruct the sensory stimulus better than what the subject 
can report by behavior. Therefore, we should be cautious not 
to over-interpret successful decoding. 

B. Future directions 

The last decade has been productive in terms of new kernels 
for many abstract (non-Euclidean) spaces, but there is still 
room for improvement. We would like to have a spike train 
kernel that is powerful enough, i.e., spd and universal, while at 
the same time, able to capture our prior knowledge about the 
similarity between spike trains. The variability of the neural 
system could provide hints to designing better spike train 
kernels. 

There are three practical aspects of designing a useful 
kernel: (1) the kernel should encode the prior knowledge of 
the input domain, and the problem at hand, (2) the kernel 
should be computationally tractable, perhaps with linear or 
less time complexity in the number of spikes; and (3) the 
kernel should have no or very few parameters — in the latter 
case there should be simple intuition behind the parameters, 
and more importantly simple rules for setting their values. 

We have discussed two frameworks in this article, namely 
the binned kernels, and the functional kernels. Binned kernels 
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are either too simplistic, they ignore the temporal structure, or 
computationally too expensive, e.g., spikemel. On the other 
hand, some functional kernels are either overly sensitive to 
the mean rate, such as linear functional kernel, or involve 
parameters that are not easily visuahzed. A kernel with the 
right balance among these three properties remains to be 
found. It is safe to assume that we have only scratched the 
surface of this problem, and there remain many open avenues 
to be explored. Two possible approaches are edit kernels, 
and generative kernels. Edit kernels rely on the principle of 
adopting simple operations such as shifting, addition, and 
deletion to convert one spike train into another where each 
operation is assigned a cost, whereas generative kernels rely on 
the principle that two spike trains are similar if they originate 
from the same generative model. 
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